[5.7] Character class and scalar coalescing fixes #588

hamishknight · 2022-07-19T13:51:45Z

5.7 cherry-pick of #570 + #574 + #559

Fixes character class range matching
- Resolves Digit matching behaving as intended? #401
- Resolves Case insensitivity behavior of character class ranges differs from PCRE #395
Fixes scalar coalescing
- Resolves Adjacent scalars should always be coalesced #572
- Resolves Better preserve scalar syntax in DSL transform #573
Fixes scalar semantic matching of quoted sequences in character classes
- Resolves Quoted character class sequences don't match in scalar mode #586

Resolves rdar://97386173

hamishknight · 2022-07-21T18:59:52Z

Explanation: Fixes the matching behavior of custom character classes. This includes properly coalescing adjacent scalars to produce correct grapheme semantics, fixing ranges to exhibit more consistent matching behavior, and fixing the scalar semantic behavior of quoted sequences. The scalar coalescing fixes also apply to the DSL, such that adjacent scalars in a DSL concatenation are coalesced to produce correct grapheme matching semantics.
Scope: Affects runtime matching semantics of custom character classes in regex literals, as well as adjacent unicode scalars values in the DSL.
Radar: rdar://97386173
Risk: Low/Medium – This change reworks a decent amount of matching engine logic around character classes, however many test cases have been added to ensure correctness. Most of the other changes are generalizations of existing logic.
Testing: Added test cases to the repo to ensure the matching behavior is correct.
Reviewer: @stephentyrone

This allows us to catch the case where a match occurs without optimizations, but doesn't occur with optimizations. Additionally fix the `xfail` param such that it can't be used on tests that actually match expectations.

Replace a couple of `#if os(Linux)` checks with a check to see if we have a newer stdlib available. This lets us emit an expected failure in the case where we're testing on an older stdlib.

Previously we performed a lexicographic comparison with the bounds of a character class range. However this produced surprising results, and our implementation didn't properly handle case sensitivity. Update the logic to instead only allow single scalar NFC bounds. The input is then converted to NFC in grapheme semantic mode, and checked against the range. In scalar semantic mode, the input scalar is checked on its own. Additionally, fix the case sensitivity handling such that we check both the lowercase and uppercase version of the input against the range.

Previously we would emit a series of scalars written in the DSL as a series of individual characters in grapheme semantic mode. Change the behavior such that we coalesce any adjacent scalars and characters, including those in regex literals and nested concatenations. We then perform grapheme breaking over the result, and can emit character matches for scalars that coalesced into a grapheme. This transform subsumes a similar transform we performed for regex literals when converting them to a DSLTree. This has the nice side effect of allowing us to better preserve scalar syntax in the DSL transform. rdar://96942688

Previously we would only match entire characters. Update to use the generic Character consumer logic that can handle scalar semantic mode. rdar://97209131

In grapheme semantic mode, coalesce adjacent character and scalar members of a custom character class, over which we can perform grapheme breaking. This involves potentially re-writing ranges such that they contain a complete grapheme of adjacent scalars.

Make sure we throw the right error for ranges that are invalid in grapheme mode, but are valid in scalar mode.

I also noticed that `lexQuantifier` could silently eat trivia if it failed to lex a quantification, so also fix that.

hamishknight · 2022-07-21T23:05:59Z

@swift-ci please test

hamishknight added the r5.7 5.7 Release Cherry Picks label Jul 19, 2022

hamishknight requested a review from stephentyrone July 19, 2022 13:52

hamishknight mentioned this pull request Jul 19, 2022

[5.7] [DNM] Null PR swiftlang/swift#42532

Closed

hamishknight force-pushed the more-fixes-5.7 branch from 29df207 to b154485 Compare July 20, 2022 20:27

stephentyrone approved these changes Jul 21, 2022

View reviewed changes

hamishknight mentioned this pull request Jul 21, 2022

[5.7] [DNM] Cherry-pick batch test PR #443

Closed

hamishknight force-pushed the more-fixes-5.7 branch from b154485 to 4d89b8d Compare July 21, 2022 22:29

hamishknight added 10 commits July 22, 2022 00:05

Validate optimizations when a match fails

8b844fd

This allows us to catch the case where a match occurs without optimizations, but doesn't occur with optimizations. Additionally fix the `xfail` param such that it can't be used on tests that actually match expectations.

Guard against testing with older stdlibs

149d1ba

Replace a couple of `#if os(Linux)` checks with a check to see if we have a newer stdlib available. This lets us emit an expected failure in the case where we're testing on an older stdlib.

Add some extra character class newline matching tests

ae63fb5

Fix scalar mode for quoted sequences in character class

12fcb52

Previously we would only match entire characters. Update to use the generic Character consumer logic that can handle scalar semantic mode. rdar://97209131

Form ASCII bitsets for quoted sequences in character classes

fdf04b9

Throw RegexCompilationError for invalid character class bounds

f64d020

Make sure we throw the right error for ranges that are invalid in grapheme mode, but are valid in scalar mode.

Allow coalescing through trivia

7dbe453

I also noticed that `lexQuantifier` could silently eat trivia if it failed to lex a quantification, so also fix that.

hamishknight force-pushed the more-fixes-5.7 branch from 4d89b8d to 7dbe453 Compare July 21, 2022 23:05

hamishknight merged commit 4eb3233 into swiftlang:swift/release/5.7 Jul 21, 2022

hamishknight deleted the more-fixes-5.7 branch July 21, 2022 23:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[5.7] Character class and scalar coalescing fixes #588

[5.7] Character class and scalar coalescing fixes #588

Uh oh!

hamishknight commented Jul 19, 2022 •

edited

Loading

Uh oh!

hamishknight commented Jul 21, 2022

Uh oh!

hamishknight commented Jul 21, 2022

Uh oh!

Uh oh!

[5.7] Character class and scalar coalescing fixes #588

[5.7] Character class and scalar coalescing fixes #588

Uh oh!

Conversation

hamishknight commented Jul 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hamishknight commented Jul 21, 2022

Uh oh!

hamishknight commented Jul 21, 2022

Uh oh!

Uh oh!

hamishknight commented Jul 19, 2022 •

edited

Loading